link to powerpoint slides: A power point presentation will describe the general principles of single-cell ‘omics’ and basics about the platforms (~15 - 20 minutes)
This workshop provides an overview of key quality control (QC) metrics and processing steps in the initial steps of single-cell transcriptomic data analysis. While detailed tutorials for processing FASTQ files, performing genome alignments, and deconvoluting cells are available through various resources, we’ll focus on understanding the fundamental QC metrics and data structures.
Links to alignment preliminary analysis tutorials:
Vendor-specific pipeline resources (e.g., 10X Genomics)
Google colab hands-on experience (under construction)
Upon receiving aligned single-cell/nuclei sequencing data, one of the initial plots to examine is the ‘knee plot,’ commonly generated during analysis pipelines. This plot helps distinguish between barcodes representing cells and those representing background noise or empty Gel Emulsion beads (GEMs).
Gel Emulsion beads (GEMs) are microfluidic droplets used in platforms like 10X Genomics, each potentially containing a cell or nuclei. Distinguishing between GEMs with cells and empty GEMs is crucial for accurate analysis.
The knee plot visually depicts how a threshold is set to differentiate between cell barcodes and background noise. In platforms like 10X Genomics, cells within GEMs produce a larger number of unique mRNA molecules (UMIs) compared to empty GEMs. The knee plot typically shows a sudden drop (knee) in UMIs, indicating the separation between GEMs containing cells and those lacking them.
Interpreting the knee plot involves identifying this knee point. A distinct knee suggests successful separation, while the absence of a clear knee could indicate issues such as low RNA integrity or high ambient RNA contamination.
In preparation for the following hands-on experiences we will set up an R environment which has everything we need.
suppressPackageStartupMessages({
library(Seurat)
library(ggplot2)
library(tidyr)
})
You should have already downloaded the dataset in a previous exercise and placed it in the downloads folder. If not, download it from here.
seurat_object <- readRDS('./downloads/pbmc3k_final.rds')
# It will give you an error message, this is OK, we fix it here
seurat_object <- UpdateSeuratObject(seurat_object)
Validating object structure
Updating object slots
Ensuring keys are in the proper structure
Updating matrix keys for DimReduc ‘pca’
Updating matrix keys for DimReduc ‘umap’
Warning: Assay RNA changing from Assay to AssayWarning: Graph RNA_nn changing from Graph to GraphWarning: Graph RNA_snn changing from Graph to GraphWarning: DimReduc pca changing from DimReduc to DimReducWarning: DimReduc umap changing from DimReduc to DimReducEnsuring keys are in the proper structure
Ensuring feature names don't have underscores or pipes
Updating slots in RNA
Updating slots in RNA_nn
Setting default assay of RNA_nn to RNA
Updating slots in RNA_snn
Setting default assay of RNA_snn to RNA
Updating slots in pca
Updating slots in umap
Setting umap DimReduc to global
Setting assay used for NormalizeData.RNA to RNA
Setting assay used for FindVariableFeatures.RNA to RNA
Setting assay used for ScaleData.RNA to RNA
Setting assay used for RunPCA.RNA to RNA
Setting assay used for JackStraw.RNA.pca to RNA
No assay information could be found for ScoreJackStraw
Warning: Adding a command log without an assay associated with itSetting assay used for FindNeighbors.RNA.pca to RNA
No assay information could be found for FindClusters
Warning: Adding a command log without an assay associated with itSetting assay used for RunUMAP.RNA.pca to RNA
Validating object structure for Assay ‘RNA’
Validating object structure for Graph ‘RNA_nn’
Validating object structure for Graph ‘RNA_snn’
Validating object structure for DimReduc ‘pca’
Validating object structure for DimReduc ‘umap’
Object representation is consistent with the most current Seurat version
Idents(seurat_object) = 'orig.ident'
print(seurat_object)
An object of class Seurat
13714 features across 2638 samples within 1 assay
Active assay: RNA (13714 features, 2000 variable features)
3 layers present: counts, data, scale.data
2 dimensional reductions calculated: pca, umap
To better understand how to work with single-cell/nuclei RNA sequencing data we will take a deep dive into how datasets are typically structured. In the diagram below we illustrate how droplet-based scRNAseq approaches separate cells and their mRNAs. But how does this translate into data?
# Raw counts in a seurat object is saved in the assays-RNA-counts slot
raw_counts = seurat_object@assays$RNA@counts
# Lets look at the first 4 columns and rows
print(raw_counts[1:4,1:4])
4 x 4 sparse Matrix of class "dgCMatrix"
AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG
AL627309.1 . . . .
AP006222.2 . . . .
RP11-206L10.2 . . . .
RP11-206L10.9 . . . .
What does this mean?
Columns: The columns represent a unique
barcode. As we can see below, this barcode
marks an individual cell which is how we can identify each one. So in
this object, the columns are the cells
Rows: We see that the first few rows have names such as RP11-206L10.9. A quick google search will tell you that this is a human gene (Ensembl ID - ENSG00000237491). So we see in this object that the rows represent the genes.
Single-cell transcriptomic analysis tools (e.g., Seurat) does a lot of the heavy lifting for us when it comes to counts and other summary data information. For example, the following code will tell you the number of genes and cells, as well store raw, normalized, and scaled data, dimension reductions, and other analyses which may have been performed.
print(seurat_object)
An object of class Seurat
13714 features across 2638 samples within 1 assay
Active assay: RNA (13714 features, 2000 variable features)
3 layers present: counts, data, scale.data
2 dimensional reductions calculated: pca, umap
This indicates that we have 13,714 genes (features) across 2,638 samples.
However, we can also extract that information from the count table:
print(paste0('Cells: ', ncol(raw_counts)))
[1] "Cells: 2638"
print(paste0('Features:', nrow(raw_counts)))
[1] "Features:13714"
These numbers should match!
# Calculate the total reads in each individual cell
read_depths = colSums(raw_counts)
# How are they distributed?
hist(read_depths, main = "Histogram of read depth", xlab = "Depth", ylab = "Frequency")
This histogram shows that the number of mRNA transcripts detected in each cell ranged from N to N, but that most were around 2,000. So what might account for these differences?
Cell types: Some cell types may simply just express more mRNA than others. For example, Hepatocytes are known to contain a lot of RNA while others may not.
Stochastic gene expression: Fluctuations in transcription and degradation from normal biological processes, cell states, and other factors in individual cells which have very low amounts of RNA to begin can appear as big differences.
Technical variability: Sequencing depth, assay kits, and other technical factors may contribute to these differences.
The zero-inflation question: In the early days of scRNAseq it was a widely accepted that data was zero-inflated, i.e., it had more zero’s than it should because of technical challenges. Today, we better understand that the large number of zero’s are not only technical, but in fact does reflect true biology as well. Here are some interesting papers on the topic: Jiang et al., 2022
Any analysis, no matter of size, should undergo rigorous QC. When it comes to single-cell transcriptomic data, there are several computational tools that help you in the process from enabling visualizations to performing a panel of QC tests and reporting several different metrics for the end user to evaluate. A list of just some of these tools can be found here.
https://www.scrna-tools.org/
https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html
Let’s go through some of the most common QC metrics, what to they mean, and why do they matter?
How does the number of detected genes and transcripts indicate quality? While we don’t have a known number of expressed genes for each cell type, we do know that cells require the expression of hundreds to thousands of genes to function properly.
Number of genes (features)
We can check these values by running the code below:
VlnPlot(seurat_object, features = 'nFeature_RNA') + NoLegend()
Here we can see that (1) the number of genes detected ranges from ~200 - 2500 and (2) most express ~900 genes. Is this normal and expected? In reality, there is no universal threshold, these values will depend on the cell type and the sequencing depth. But we do know that if most cells had extemely low counts that either cells were not healthy/intact or sequencing depth was much too low. It could also point to errors further upstream such as alignment to the incorrect reference genome.
Number of mRNA molecules (count)
We can also examine the total number of mRNA molecules detected:
VlnPlot(seurat_object, features = 'nCount_RNA')
As above, we can see a good distribution centered around 2500 mRNA molecules. These value should always be higher than the number of features.
Correlation of features and molecules
A third way to look at this data is to examine the correlation of features and mRNA molecules:
FeatureScatter(object = seurat_object, feature1 = 'nCount_RNA', feature2 = 'nFeature_RNA')
This metric is more informative because it tells you that as you detect more mRNA molecules, you also detect more different genes. This would generally be expected as you are getting a larger sampling of the mRNA resulting in an increased chance of sampling different genes. However, when you assess these plots, you will want to keep the biology in mind.
Here, biological knowledge is extremely important for evaluating this metric and removing cells based on a threshold. Consider these three experiments:
Nuclei isolated from muscle cells which require a lot of energy
Muscle whole cells which require a lot of energy
Skin cells which are less metabolically active
What would you expect to see (qualitative) in the amount of mitochondrial genes for each of these experiments?
Assessing percent mitochondria counts
We can assess the percentage of mitochondrial genes by calculating the number of counts coming from mitochondrial genes (usually prefixed with mt- or MT-) compared to the total counts. This can be visualized like this:
VlnPlot(seurat_object, features = 'percent.mt')
Note that this dataset was cleaned before downloading so the percentage of mitochondrial genes is extremely low (maximum 5%). It is typical to throw out cells with higher percentage of mitochondrial genes with the assumption that they are unhealthy. However, think back to the three experiments and consider what you would expect to see:
Nuclei isolated from muscle cells which require a lot of energy
Despite being metabolically active, nuclei isolation is expected to wash out mitochondria. High contamination is indicative of high ambient RNA or mitochondria sticking to the the nuclei.
Muscle whole cells which require a lot of energy
Given that muscle cells are highly metabolically active and are known to have among the most mitochondria among mammalian cells, these data may show higher percentage of mitochondrial genes than most datasets.
Skin cells which are less metabolically active
Given that these are less metabolically active, they may be expected to show similar or lower percentage of mitochondrial genes.
Correlation of counts and mitochondrial counts
As with the features and counts, we can look at how these correlate:
FeatureScatter(seurat_object, feature1 = 'nCount_RNA', feature2 = 'percent.mt')
Interestingly, these are not correlated at all. Prior to filtering, these values will be anti-correlated as mitochondria often makes up the ambient RNA (release from dead or lysed cells) and will be more represented in the background (low count cells).
{r} seurat_object <- NormalizeData(seurat_object, normalization.method = "LogNormalize" ) normalized_data = seurat_object@assays$RNA@data # Calculate the total reads in each individual cell read_depths = colSums(normalized_data) # How are they distributed? hist(read_depths, main = "Histogram of read depth", xlab = "Depth", ylab = "Frequency")}
What does a gene look like?
{r} hist(normalized_data['S100A10', ], main = "Histogram of read depth", xlab = "Depth", ylab = "Frequency", breaks = 200)}
{r} hist(normalized_data[,'CATTGTACTTTGCT'], main = "Histogram of read depth", xlab = "Depth", ylab = "Frequency", breaks = 200)}
6. Dimension reduction and clustering; 7. Annotation of cell types; 8. Trajectories and RNA velocity; 9. Inferring cell-cell communication; 10. Spatial transcriptomics